{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.026664,
     "end_time": "2020-06-23T18:41:12.994224",
     "exception": false,
     "start_time": "2020-06-23T18:41:12.967560",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Exploratory Data Analysis of NOAA Weather Data \n",
    "\n",
    "This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/jfk-weather-data/).\n",
    "\n",
    "In this notebook we visualize and analyze the weather time-series dataset.\n",
    "\n",
    "### Table of Contents:\n",
    "* [1. Read the Cleaned Data](#cell1)\n",
    "* [2. Visualize the Data](#cell2)\n",
    "* [3. Analyze Trends in the Data](#cell3)\n",
    "* [Authors](#authors)\n",
    "\n",
    "#### Import required packages\n",
    "\n",
    "Install and import the required packages:\n",
    "\n",
    "* pandas\n",
    "* matplotlib\n",
    "* seaborn\n",
    "* numpy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:41:13.057341Z",
     "iopub.status.busy": "2020-06-23T18:41:13.055947Z",
     "iopub.status.idle": "2020-06-23T18:41:39.978213Z",
     "shell.execute_reply": "2020-06-23T18:41:39.977136Z"
    },
    "papermill": {
     "duration": 26.955318,
     "end_time": "2020-06-23T18:41:39.978610",
     "exception": false,
     "start_time": "2020-06-23T18:41:13.023292",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Installing packages needed for data processing and visualization\n",
    "!pip install pandas matplotlib seaborn numpy "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:41:40.369313Z",
     "iopub.status.busy": "2020-06-23T18:41:40.367774Z",
     "iopub.status.idle": "2020-06-23T18:41:44.347540Z",
     "shell.execute_reply": "2020-06-23T18:41:44.348452Z"
    },
    "papermill": {
     "duration": 4.170177,
     "end_time": "2020-06-23T18:41:44.348890",
     "exception": false,
     "start_time": "2020-06-23T18:41:40.178713",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Importing the packages\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from pandas import DataFrame as df\n",
    "from matplotlib import pyplot as plt\n",
    "plt.rcParams['figure.dpi'] = 160"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.128767,
     "end_time": "2020-06-23T18:41:44.616934",
     "exception": false,
     "start_time": "2020-06-23T18:41:44.488167",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell1\"></a>\n",
    "\n",
    "### 1. Read the Cleaned Data\n",
    "\n",
    "We start by reading in the cleaned dataset that was created in notebook `Part 1 - Data Cleaning`. \n",
    "\n",
    "*Note* if you haven't yet run that notebook, do that first otherwise the cells below will not work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:41:44.870589Z",
     "iopub.status.busy": "2020-06-23T18:41:44.869715Z",
     "iopub.status.idle": "2020-06-23T18:41:47.222803Z",
     "shell.execute_reply": "2020-06-23T18:41:47.223766Z"
    },
    "papermill": {
     "duration": 2.487078,
     "end_time": "2020-06-23T18:41:47.224072",
     "exception": false,
     "start_time": "2020-06-23T18:41:44.736994",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv('data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv', parse_dates=['DATE'])\n",
    "# Set date index\n",
    "data = data.set_index(pd.DatetimeIndex(data['DATE']))\n",
    "data.drop(['DATE'], axis=1, inplace=True)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.116207,
     "end_time": "2020-06-23T18:41:47.455659",
     "exception": false,
     "start_time": "2020-06-23T18:41:47.339452",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell2\"></a>\n",
    "\n",
    "### 2. Visualize the Data\n",
    "\n",
    "In this section we visualize a few sections of the data, using `matplotlib`'s `pyplot` module. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:41:47.682980Z",
     "iopub.status.busy": "2020-06-23T18:41:47.681927Z",
     "iopub.status.idle": "2020-06-23T18:41:47.689885Z",
     "shell.execute_reply": "2020-06-23T18:41:47.688423Z"
    },
    "papermill": {
     "duration": 0.119332,
     "end_time": "2020-06-23T18:41:47.690109",
     "exception": false,
     "start_time": "2020-06-23T18:41:47.570777",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Columns to visualize\n",
    "plot_cols = ['dry_bulb_temp_f', 'relative_humidity', 'wind_speed', 'station_pressure', 'precip']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.087035,
     "end_time": "2020-06-23T18:41:47.866770",
     "exception": false,
     "start_time": "2020-06-23T18:41:47.779735",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### Quick Peek at the Data\n",
    "\n",
    "We first visualize all the data we have to get a rough idea about how the data looks like. \n",
    "\n",
    "As we can see in the plot below, the hourly temperatures follow a clear seasonal trend. Wind speed, pressure, humidity and precipitation data seem to have much higher variance and randomness.\n",
    "\n",
    "It might be more meaningful to make a model to predict temperature, rather than some of the other more noisy data columns. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:41:48.119716Z",
     "iopub.status.busy": "2020-06-23T18:41:48.117853Z",
     "iopub.status.idle": "2020-06-23T18:41:52.078532Z",
     "shell.execute_reply": "2020-06-23T18:41:52.079334Z"
    },
    "papermill": {
     "duration": 4.109965,
     "end_time": "2020-06-23T18:41:52.079730",
     "exception": false,
     "start_time": "2020-06-23T18:41:47.969765",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Quick overview of columns\n",
    "plt.figure(figsize=(30, 12))\n",
    "i = 1\n",
    "for col in plot_cols:\n",
    "    plt.subplot(len(plot_cols), 1, i)\n",
    "    plt.plot(data[col].values)\n",
    "    plt.title(col)\n",
    "    i += 1\n",
    "plt.subplots_adjust(hspace=0.5)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.140846,
     "end_time": "2020-06-23T18:41:52.416480",
     "exception": false,
     "start_time": "2020-06-23T18:41:52.275634",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### Feature Dependencies\n",
    "\n",
    "Now we explore how the features (columns) of our data are related to each other. This helps in deciding which features to use when modelling a classifier or regresser. \n",
    "We ideally want independent features to be classified independently and likewise dependent features to be contributing to the same model. \n",
    "\n",
    "We can see from the correlation plots how some features are somewhat correlated and could be used as additional data (perhaps for augmenting) when training a classifier. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:41:52.778282Z",
     "iopub.status.busy": "2020-06-23T18:41:52.776521Z",
     "iopub.status.idle": "2020-06-23T18:41:53.792117Z",
     "shell.execute_reply": "2020-06-23T18:41:53.786362Z"
    },
    "papermill": {
     "duration": 1.164636,
     "end_time": "2020-06-23T18:41:53.792419",
     "exception": false,
     "start_time": "2020-06-23T18:41:52.627783",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Plot correlation matrix\n",
    "f, ax = plt.subplots(figsize=(7, 7))\n",
    "corr = data[plot_cols].corr()\n",
    "sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True, ax=ax);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.15953,
     "end_time": "2020-06-23T18:41:54.149759",
     "exception": false,
     "start_time": "2020-06-23T18:41:53.990229",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Additionally we also visualize the joint distrubitions in the form of pairplots/scatter plots to see (qualitatively) the way in which these features are related in more detail over just the correlation.\n",
    "They are essentially 2D joint distributions in the case of off-diagonal subplots and the histogram (an approximation to the probability distribution) in case of the diagonal subplots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:41:54.421853Z",
     "iopub.status.busy": "2020-06-23T18:41:54.420601Z",
     "iopub.status.idle": "2020-06-23T18:42:22.529784Z",
     "shell.execute_reply": "2020-06-23T18:42:22.531481Z"
    },
    "papermill": {
     "duration": 28.253543,
     "end_time": "2020-06-23T18:42:22.531950",
     "exception": false,
     "start_time": "2020-06-23T18:41:54.278407",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Plot pairplots\n",
    "sns.pairplot(data[plot_cols])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.211878,
     "end_time": "2020-06-23T18:42:23.041456",
     "exception": false,
     "start_time": "2020-06-23T18:42:22.829578",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell3\"></a>\n",
    "\n",
    "### 3. Analyze Trends in the Data\n",
    "\n",
    "Now that we have explored the whole dataset and the features on a high level, let us focus on one particular feature - `dry_bulb_temp_f`, the dry bulb temperature in degrees Fahrenheit. This is what we mean when we refer to \"air temperature\". This is the most common feature used in temperature prediction, and here we explore it in further detail. \n",
    "\n",
    "We first start with plotting the data for all 9 years in monthly buckets then drill down to a single year to notice (qualitatively) the overall trend in the data. We can see from the plots that every year has roughly a sinousoidal nature to the temperature with some anomalies around 2013-2014. Upon further drilling down we see that each year's data is not the smooth sinousoid but rather a jagged and noisy one. But the overall trend still is a sinousoid. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:42:23.463994Z",
     "iopub.status.busy": "2020-06-23T18:42:23.463174Z",
     "iopub.status.idle": "2020-06-23T18:42:25.738259Z",
     "shell.execute_reply": "2020-06-23T18:42:25.735899Z"
    },
    "papermill": {
     "duration": 2.487948,
     "end_time": "2020-06-23T18:42:25.738514",
     "exception": false,
     "start_time": "2020-06-23T18:42:23.250566",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15,7))\n",
    "\n",
    "TEMP_COL = 'dry_bulb_temp_f'\n",
    "# Plot temperature data converted to a monthly frequency \n",
    "plt.subplot(1, 2, 1)\n",
    "data[TEMP_COL].asfreq('M').plot()  \n",
    "plt.title('Monthly Temperature')\n",
    "plt.ylabel('Temperature (F)')\n",
    "\n",
    "# Zoom in on a year and plot temperature data converted to a daily frequency \n",
    "plt.subplot(1, 2, 2)\n",
    "data['2017'][TEMP_COL].asfreq('D').plot()  \n",
    "plt.title('Daily Temperature in 2017')\n",
    "plt.ylabel('Temperature (F)')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.279776,
     "end_time": "2020-06-23T18:42:26.263290",
     "exception": false,
     "start_time": "2020-06-23T18:42:25.983514",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Next, we plot the change (delta) in temperature and notice that it is lowest around the middle of the year. That is expected behaviour as the gradient of the sinousoid near it's peak is zero. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:42:26.788207Z",
     "iopub.status.busy": "2020-06-23T18:42:26.787291Z",
     "iopub.status.idle": "2020-06-23T18:42:29.121789Z",
     "shell.execute_reply": "2020-06-23T18:42:29.123377Z"
    },
    "papermill": {
     "duration": 2.611371,
     "end_time": "2020-06-23T18:42:29.123704",
     "exception": false,
     "start_time": "2020-06-23T18:42:26.512333",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15,7))\n",
    "\n",
    "# Plot percent change of daily temperature in 2017\n",
    "plt.subplot(1, 2, 1)\n",
    "data['2017'][TEMP_COL].asfreq('D').div(data['2017'][TEMP_COL].asfreq('D').shift()).plot()\n",
    "plt.title('% Change in Daily Temperature in 2017')\n",
    "plt.ylabel('% Change')\n",
    "\n",
    "# Plot absolute change of temperature in 2017 with daily frequency\n",
    "plt.subplot(1, 2, 2)\n",
    "data['2017'][TEMP_COL].asfreq('D').diff().plot()\n",
    "plt.title('Absolute Change in Daily Temperature in 2017')\n",
    "plt.ylabel('Temperature (F)')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.286847,
     "end_time": "2020-06-23T18:42:29.665508",
     "exception": false,
     "start_time": "2020-06-23T18:42:29.378661",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Finally we apply some smoothing to the data in the form of a rolling/moving average. This is the simplest form of de-noising the data. As we can see from the plots, the average (plotted in blue) roughly traces the sinousoid and is now much smoother. This can improve the accuracy of a regression model trained to predict temperatures within a reasonable margin of error. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:42:30.302485Z",
     "iopub.status.busy": "2020-06-23T18:42:30.301288Z",
     "iopub.status.idle": "2020-06-23T18:42:33.458643Z",
     "shell.execute_reply": "2020-06-23T18:42:33.445922Z"
    },
    "papermill": {
     "duration": 3.513659,
     "end_time": "2020-06-23T18:42:33.458941",
     "exception": false,
     "start_time": "2020-06-23T18:42:29.945282",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "plt.figure(figsize=(15,7))\n",
    "\n",
    "# Plot rolling mean of temperature \n",
    "plt.subplot(1, 2, 1)\n",
    "data['2017'][TEMP_COL].rolling('5D').mean().plot(zorder=2) # Rolling average window is 5 days\n",
    "data['2017'][TEMP_COL].plot(zorder=1)\n",
    "plt.legend(['Rolling','Temp'])\n",
    "plt.title('Rolling Avg in Hourly Temperature in 2017')\n",
    "plt.ylabel('Temperature (F)')\n",
    "\n",
    "# Plot rolling mean of temperature\n",
    "plt.subplot(1, 2, 2)\n",
    "data['2017-01':'2017-03'][TEMP_COL].rolling('2D').mean().plot(zorder=2) # Rolling average window is 2 days\n",
    "data['2017-01':'2017-03'][TEMP_COL].plot(zorder=1)\n",
    "plt.legend(['Rolling','Temp'])\n",
    "plt.title('Rolling Avg in Hourly Temperature in Winter 2017')\n",
    "plt.ylabel('Temperature (F)')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.202051,
     "end_time": "2020-06-23T18:42:33.888267",
     "exception": false,
     "start_time": "2020-06-23T18:42:33.686216",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### Next steps\n",
    "\n",
    "- Close this notebook.\n",
    "- Open the `Part 3 - Time Series Forecasting` notebook to create time-series models to forecast temperatures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.218222,
     "end_time": "2020-06-23T18:42:34.348671",
     "exception": false,
     "start_time": "2020-06-23T18:42:34.130449",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"authors\"></a> \n",
    "### Authors\n",
    "\n",
    "This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).\n",
    "\n",
    "Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "papermill": {
   "duration": 82.979462,
   "end_time": "2020-06-23T18:42:34.718798",
   "environment_variables": {},
   "exception": null,
   "input_path": "Part 2 - Data Analysis.ipynb",
   "output_path": "Part 2 - Data Analysis-output.ipynb",
   "parameters": {},
   "start_time": "2020-06-23T18:41:11.739336",
   "version": "2.1.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
